EN FR
EN FR


Section: New Results

Efficient Floating-Point Arithmetic and Applications

Participants : Nicolas Brisebarre, Claude-Pierre Jeannerod, Mioara Joldeş, Jingyan Jourdan-Lu, Vincent Lefèvre, Nicolas Louvet, Érik Martin-Dorel, Christophe Mouilleron, Jean-Michel Muller, Adrien Panhaleux.

Correctly Rounded Sums

P. Kornerup (Odense Univ., Denmark), V. Lefèvre, N. Louvet and J.-M. Muller have given a study of some basic blocks needed in the design of floating-point summation algorithms. In particular, in radix-2 floating-point arithmetic, they have shown that among the set of the algorithms with no comparisons performing only floating-point additions/subtractions, the 2Sum algorithm introduced by Knuth is minimal, both in terms of number of operations and depth of the dependency graph. They have investigated the possible use of another algorithm, Dekker's Fast2Sum algorithm, in radix-10 arithmetic. Under reasonable conditions, they have also proven that no algorithms performing only round-to-nearest additions/subtractions exist to compute the round-to-nearest sum of at least three floating-point numbers. Starting from an algorithm due to Boldo and Melquiond, they have also presented new results about the computation of the correctly-rounded sum of three floating-point numbers [21] .

Error of an FMA

The fused multiply-add (FMA) instruction, specified by the IEEE 754-2008 Standard for Floating-Point Arithmetic, eases some calculations, and is already available on some current processors such as the Power PC or the Itanium. S. Boldo (EPI Proval) and J.-M. Muller first extended an earlier work on the computation of the exact error of an FMA (by giving more general conditions and providing a formal proof). Then, they presented a new algorithm that computes an approximation to the error of an FMA, and provide error bounds and a formal proof for that algorithm [16] .

Accurate computation of ad-bc with an FMA

C.-P. Jeannerod, N. Louvet, and J.-M. Muller have provided in [60] a detailed rounding error analysis of Kahan's FMA-based algorithm for the computation of expressions of the form ad-bc. They showed that Kahan's algorithm is always highly accurate, and under mild assumptions on the radix and the precision gave an optimal bound on the absolute error and an asymptotically optimal bound on the relative error. They also studied how the relative error varies as a function of the relative order of magnitude of the two products ad and bc. Finally, they investigated whether the error bounds can be improved in special cases like sums of squares and discriminants.

Performing Arithmetic Operations on Round-to-Nearest Operations

During any composite computation, there is a constant need for rounding intermediate results before they can participate in further processing. Recently, a class of number representations denoted RN-Codings were introduced, allowing an unbiased rounding-to-nearest to take place by a simple truncation, with the property that problems with double-roundings are avoided. P. Kornerup (Odense Univ., Denmark), J.-M. Muller and A. Panhaleux first investigate a particular encoding of the binary representation. This encoding is generalized to any radix and digit set; however, radix complement representations for even values of the radix turn out to be particularly feasible. The encoding is essentially an ordinary radix complement representation with an appended round-bit, but still allowing rounding-to-nearest by truncation, and thus avoiding problems with double-roundings. Conversions from radix complement to these round-to-nearest representations can be performed in constant time, whereas conversion the other way, in general, takes at least logarithmic time. Not only is rounding-to-nearest a constant time operation, but so is also sign inversion, both of which are at best log-time operations on ordinary two's complement representations. Addition and multiplication on such fixed-point representations are first analyzed and defined in such a way that rounding information can be carried along in a meaningful way, at minimal cost. The analysis is carried through for a compact (canonical) encoding using two's complement representation, supplied with a round-bit. Based on the fixed-point encoding, it is shown possible to define floating-point representations, and a sketch of the implementation of an FPU is presented [22] .

Augmented Precision Square Roots, 2-D Norms, and Discussion on Correctly Rounding x 2 +y 2

Define an “augmented precision” algorithm as an algorithm that returns, in precision-p floating-point arithmetic, its result as the unevaluated sum of two floating-point numbers, with a relative error of the order of 2 -2p . Assuming an FMA instruction is available, N. Brisebarre, M. Joldeş, P. Kornerup (Odense University, Denmark), E. Martin-Dorel and J.-M. Muller perform a tight error analysis of an augmented precision algorithm for the square root, and introduce two slightly different augmented precision algorithms for the 2D-norm x 2 +y 2 . Then they give tight lower bounds on the minimum distance (in ulps) between x 2 +y 2 and a midpoint when x 2 +y 2 is not itself a midpoint. This allows them to determine cases when their algorithms make it possible to return correctly-rounded 2D-norms [30] .

Midpoints and exact points of some algebraic functions in floating-point arithmetic

When implementing a function f in floating-point arithmetic, if we wish correct rounding and good performance, it is important to know if there are input floating-point values x such that f(x) is either the middle of two consecutive floating-point numbers (assuming rounded-to-nearest arithmetic), or a floating-point number (assuming rounded toward ± or toward 0 arithmetic). In the first case f(x) is a midpoint, and in the second case it is an exact point. In [20] C.-P. Jeannerod, N. Louvet, J.-M. Muller, and A. Panhaleux have studied whether such midpoints and exact points exist for some usual algebraic functions and various floating-point formats. When midpoints or exact points exist, they have been characterized or, when possible, listed exhaustively. The results and the techniques presented in this paper can be used in particular to deal with both the binary and the decimal formats defined in the IEEE 754-2008 standard for floating-point arithmetic.